74 research outputs found

    Preserving high-level semantics of parallel programming annotations through the compilation flow of optimizing compilers

    Get PDF
    International audienceThis paper presents a technique for representing the high level semantics of parallel programming languages in the intermediate representation of optimizing compilers. The semantics of these languages does not fit well in the intermediate representation of classical optimizing compilers, designed for single-threaded applications, and is usually lowered to threaded code with opaque concurrency bindings through source-to-source compilation or a front-end compiler pass. The semantical properties of the high-level parallel language are obfuscated at a very early stage of the compilation flow. This is detrimental to the effectiveness of downstream optimizations. We define the properties we introduce in this representation and prove that they are preserved by existing optimization passes. We characterize the optimizations that are enabled or interfere with this representation and evaluate the impact of the serial optimizations enabled by this technique for concurrent programs, using a prototype implemented in a branch of GCC 4.6. While we focus on the OpenMP language as a running example, we also analyze how our semantical abstraction can serve the unification of the analyses and optimizations for a variety of parallel programming languages

    A Stream-Comptuting Extension to OpenMP

    Get PDF
    PosterInternational audienceThis paper introduces an extension to OpenMP 3.0 enabling stream programming with minimal, incremental additions that seamlessly integrate into the current specification. The stream programming model decomposes programs into tasks and explicits the flow of data among them, thus exposing data, task and pipeline parallelism. It helps the programmers to express concurrency and data locality properties, avoiding non-portable low-level code and early optimizations. We survey the diverse motivations and constraints converging towards the design of our simple yet powerful language extension, and we present experimental results of a prototype implementation in a public branch of GCC 4.5

    Advances in Parallel-Stage Decoupled Software Pipelining Leveraging Loop Distribution, Stream-Computing and the SSA Form

    No full text
    8 pages Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors-Compilers, OptimizationInternational audienceDecoupled Software Pipelining (DSWP) is a program partitioning method enabling compilers to extract pipeline parallelism from sequential programs. Parallel Stage DSWP (PS-DSWP) is an extension that also exploits the data parallelism within pipeline filters. This paper presents the preliminary design of a new PS-DSWP method capable of handling arbitrary structured control flow, a slightly better algorithmic complexity, the natural exploitation of nested parallelism with communications across arbitrary levels, with a seamless integration with data-flow parallel programming environments. It is inspired by loop-distribution and supports nested/structured partitioning along with the hierarchy of control dependences. The method relies on a data-flow streaming extension of OpenMP. These advances are made possible thanks to progresses in compiler intermediate representation. We describe our usage of the Static Single Assignment (SSA) form, how we extend it to the context of concurrent streaming tasks, and we discuss the benefits and challenges for PS-DSWP

    Automatic Detection of Performance Anomalies in Task-Parallel Programs

    Full text link
    To efficiently exploit the resources of new many-core architectures, integrating dozens or even hundreds of cores per chip, parallel programming models have evolved to expose massive amounts of parallelism, often in the form of fine-grained tasks. Task-parallel languages, such as OpenStream, X10, Habanero Java and C or StarSs, simplify the development of applications for new architectures, but tuning task-parallel applications remains a major challenge. Performance bottlenecks can occur at any level of the implementation, from the algorithmic level (e.g., lack of parallelism or over-synchronization), to interactions with the operating and runtime systems (e.g., data placement on NUMA architectures), to inefficient use of the hardware (e.g., frequent cache misses or misaligned memory accesses); detecting such issues and determining the exact cause is a difficult task. In previous work, we developed Aftermath, an interactive tool for trace-based performance analysis and debugging of task-parallel programs and run-time systems. In contrast to other trace-based analysis tools, such as Paraver or Vampir, Aftermath offers native support for tasks, i.e., visualization, statistics and analysis tools adapted for performance debugging at task granularity. However, the tool currently does not provide support for the automatic detection of performance bottlenecks and it is up to the user to investigate the relevant aspects of program execution by focusing the inspection on specific slices of a trace file. In this paper, we present ongoing work on two extensions that guide the user through this process.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

    Design of graphite and the Polyhedral Compilation Package

    No full text
    Graphite is the loop transformation framework that was introduced in GCC 4.4. This paper gives a detailed description of the design and future directions of this infrastructure. Graphite uses the polyhedral model as the internal representation (GPOLY). The plan is to create a polyhedral compilation package (PCP) that will provide loop optimization and analysis capabilities to GCC. This package will be separated from GIMPLE via an interface language that is restricted to express only what GPOLY can represent. The interface language is a set of data structures that encodes the control flow and memory accesses of a code region. A syntax for the language is also defined to facilitate debugging and testing

    Expressiveness and Data-Flow Compilation of OpenMP Streaming Programs

    Get PDF
    We present a data-flow extension of OpenMP to express highly dynamic control and data flow over nested, dependent tasks. The language supports dynamic creation, modular composition, variable and unbounded sets of producers/consumers, separate compilation, and first-class streams. These features, enabled by our original compilation flow, allow translating high-level parallel programming patterns, like dependences arising from StarSs' array regions, or universal low-level primitives like futures. In particular, these dynamic features can be embedded efficiently and naturally into an unmanaged imperative language, avoiding the complexity and overhead of a concurrent garbage collector. We demonstrate the performance advantages of a data-flow execution model compared to more restricted task and barrier models. We also demonstrate the efficiency of our compilation and runtime algorithms for the support of complex dependence patterns arising from StarSs benchmarks.Ce papier prĂ©sente une extension du langage OpenMP permettant d'exprimer le parallĂ©lisme sous forme de tĂąches dĂ©pendantes imbriquĂ©es avec flots de contrĂŽle et de donnĂ©es dynamiques. Ce nouveau modĂšle de programmation permet la crĂ©ation dynamique de tĂąches, la composition et la compilation modulaires, ainsi que des ensembles variables, non bornĂ©s, de producteurs et consommateurs dans des flots de donnĂ©es, ou streams, de premiĂšre classe. Nous prĂ©sentons un nouvel algorithme de gĂ©nĂ©ration de code permettant de traduire des constructions de programmation parallĂšle de haut niveau, comme les dĂ©pendances issues des rĂ©gions de tableaux du langage StarSs, ou des primitives universelles, de bas niveau, telles que les futures. Nous montrons que ces propriĂ©tĂ©s dynamiques peuvent ĂȘtre efficacement intĂ©grĂ©es dans un langage impĂ©ratif avec gestion explicite de la mĂ©moire, Ă©vitant ainsi la complexitĂ© et le coĂ»t d'un ramasse-miettes concurrent

    A formal method for rule analysis and validation in distributed data aggregation service

    Get PDF
    The usage of Cloud Serviced has increased rapidly in the last years. Data management systems, behind any Cloud Service, are a major concern when it comes to scalability, flexibility and reliability due to being implemented in a distributed way. A Distributed Data Aggregation Service relying on a storage system meets these demands and serves as a repository back-end for complex analysis and automatic mining of any type of data. In this paper we continue our previous work on data management in Cloud storage. We present a formal approach to express retrieval and aggregation rules with a compact, yet powerful tool called Rule Markup Language. Our extended solution proposes a standard form to schemes and uses the tool to match the rules to the XML form of the structured data in order to obtain the unstructured entries from BlobSeer data storage system. This allows the Distributed Data Aggregation Service (DDAS) to bypass several steps when processing a retrieval request. Our new architecture is more loosely-coupled with a separate module, the new tool, used fo

    Erbium: A Deterministic, Concurrent Intermediate Representation for Portable and Scalable Performance

    Get PDF
    PosterInternational audienceOptimizing compilers and runtime libraries do not shield programmers from the complexity of multi-core hardware; as a result the need for manual, target-specific optimizations increases with every processor generation. High-level languages are being designed to express concurrency and locality without reference to a particular architecture. But compiling such abstractions into efficient code requires a portable, intermediate representation: this is essential for modular composition (separate compilation), for optimization frameworks independent of the source language, and for just-in-time compilation of bytecode languages. This paper introduces Erbium, an intermediate representation for compilers, a low-level language for efficiency programmers, and a lightweight runtime implementation. It relies on a data structure for scalable and deterministic concurrency, called Event Record, exposing the data-level, task and pipeline parallelism suitable to a given target. We provide experimental evidence of the productivity, scalability and efficiency advantages of Erbium, relying on a prototype implementation in GCC 4.3

    Scalable Task Parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management

    Get PDF
    International audienceDynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications
    • 

    corecore